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FOREWORD 


This Indian Standard was adopted by the Bureau of Indian Standards after the draft finalized by the Documentation 
and Information Sectional Committee, had been approved by the Management and Systems Division Council. 


Accessibility, usability, and interoperability of resources depend on how well these are described. Textual analysis 
helps to describe information resources. Content analysis, an approach to text analysis, involves systematic coding 
and interpretation of information. It involves set of procedures for data reduction and its subjective interpretation. 
It is an analytical research method, which can be applied to any branch of knowledge. Content analysis involves 
categorisation of information that helps to analyse and retrieve it by machines. The approach would be helpful to 
anyone who is interested in analysing literature thematically/subject wise. 


The composition of the Committee responsible for the formulation of this standard is given in Annex B. 
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INTRODUCTION 


In content analysis, main focus is to classify or interpret on inferring by capturing the meaning and themes 
embedded in text. Further, it is believed that content analysis involves analysing text based on word count or 
frequency of occurrence of words. This assumption is usually on common notion that words which appear most 
frequently are the key concepts that reflect the main theme. It has its own limitations as it is based on terms and 
not concepts. It fails to identify relations and themes, and also fails to take into account synonyms, homonyms 
and homographs. 


There are two approaches for content analysis, quantitative analysis and qualitative analysis. 


a) Quantitative/Conceptual Content Analysis: This approach focuses on frequency with which themes or 
words appear in text. Its aim is to identify and describe the concepts appearing in text. Though, the frequency 
criterion ensures the inclusion of words with high frequency and the omission of low frequency words. But 
it has its limitations in identifying association between terms. Here, either the existing or prior developed 
coding schemes have been used for coding and categorisation. 

b) Qualitative/Relational Content Analysis: Content analysis is not limited to simple word counts or extracting 
objective content from material. It goes beyond this and focuses on establishing the range of semantics 
missing in quantitative analysis. Instead of illustrating the statistical significance of concepts, emphasis is on 
defining the theme, description and meaning embedded in text. As this approach does not rely on statistical 
methods for defining causation or result analysis, therefore, establishing validity of results, time constraints 
due to its subjective nature, etc. are some of the limitations. 

Though, both the approaches are mutually exclusive, but using them in combination supplements each other. 
Further, this monograph is illustrative and presents the procedure. It does not specify the algorithms or computer 
programs meant for content analysis. 


Here, content analysis has been considered as an approach to build understanding of the content by analysing 
it with different tool and techniques. The focus is to analyse or understand the text content on the basis a set of 
categories identified from the content. Though, in digital world most of the analysis work has been performed 
with the help of computers, but the scope of this document has not neglected the role of humans who are capable 
of mining and contextualising the latent content from scholarly digital communications. 


ROLE OF COMPUTER IN CONTENT ANALYSIS 


Content analysis when performed manually may be labour intensive, tedious, and time consuming task. Whereas, 
computer assisted approach reduces efforts, helps to overcome content/term selection biasness, improves content 
coverage, and cuts the time involved in performing the task. 


Task of content analysis can be simplified with the application of computers in performing various activities like: 


a) It supports data markup, breaking the text for analysis, classification of similar instances together, global 
coding and editing. 

b) Helps to match the text against mechanised dictionaries for coding. 

c) Assists in collecting and maintaining of data. 

d) Helps in keeping track of various steps of analysis, and allow further, to replicate the analysis. 

e) Automated statistical packages facilitate frequency count, and other calculations. 

f) Allows sharing, dissemination, and reuse of coded content. 


Though the mechanised approach has various advantages over the manual method, still it has following limitations, 
which direct the use of mixed approach: 


1) Inability to deal with ambiguities related with homographs (same words with different meanings), 
metaphors (word/phrase used to represent likeness or analogy between objects) etc. 


2) It processes the data without considering the context in which it appears i.e. incapability to relate pronouns 
with nouns appearing in text. 


3) Text crunching may affect precision in results. 


Thus, to establish reliability and to validate the results of mechanized approach involvement of human beings is 
essential. 
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Indian Standard 
CONTENT ANALYSIS — A GUIDE 


1SCOPE 


The procedural guidelines deal with the digital content 
available in English language. 


2 TERMINOLOGY 


2.1 Categorisation — Grouping of codified data on 
the basis of similarity. 


2.0 Coding — Reviewing and summarising key 
elements of content into a structured format. 


2.3 Content Analysis — It is a technique that 
comprises a series of procedures to draw valid 
inferences about the text selected for the study. 


2.4 Controlled Vocabulary — Controlled 
vocabularies acts as source for core concepts of the 
field. They also help in managing the issues associated 
with the existence of synonyms, and other equivalent 
words in the text. Thesaurus is an example of a 
controlled vocabulary. 


2.5 Co-Occurrence — It denotes the existence of two 
or more terms from the content under consideration 
alongside each other. 


2.6 Correlation Coefficient — It is used to measure 
the strength of relationship between concepts. 


-1 0 +1 
(two words (unrelated) (presence of a word 
occur together) in a phrase due to 


the absence of other 
word and vice versa) 


2.7 Data — Meaningful text appropriate for content 
analysis. 


2.8 Data Driven — Decisions or progress in 
processes dependent entirely on data. 


2.9 Dictionary — It represents the pool of suitable 
categories defined by analyst for content analysis. 


2.10 Interoperability — It is the ability of system to 
work in conjunction with other system(s). 


2.11 Metadata — Data that describes the distinct 
facets of information to enhance its usage. 


2.12 Nodes — Terms/phrases which symbolise the 
core concepts of the content. 


2.13 Pattern Rule — Rule to define a pattern in the 
phrase. 


2.14 Percent Agreement — Used to present the 
percentage of association between varied results: 


difference of values 


Percent agreement — 
average of two values 


2.15 Relative Frequency — A statistical method 
where core of text which may be considered as 
representative in some sense need to be defined first and 
frequency of each word in that core must be identified. 
Now, to obtain the relative frequency, word's frequency 
in a specific text may be compared with the frequency 
of word in the core of text. Usually, it is denoted as 
follows: 


Count in subgroup 


Relative frequency = 
Total count 


2.16 Semantic Web — It is an extension of the 
current web where adding semantic description(s) to 
the web resources allows machines and humans to 
interpret, use and share the information in cooperation. 


2.17 Statistical Techniques — Procedures under 
this category may be referred as elementary 
analysis techniques. These techniques deal with 
the identifications of words, word groups, and their 
frequencies. 


2.18 Stemming — A statistical approach for correct 
identification of word frequency. It involves reducing 
variant word forms to common meaningful base form, 
which may further be used in getting word count. 
Stemming involves algorithm development to eliminate 
common suffixes /prefixes. 


2.19 Stop-list — The class of words which do not 
express the content. 
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2.20 Successor Variety Method — The successor 
variety of a prefix is the frequency of different letters 
following it. 


For example: 
Test word — computer 


Corpus-compare, computation, 
compute, computer, computing 


Prefix Successor Different Letters 
Variety 

c 1 o 

co 1 m 

com 1 p 

comp 2 a, u 
compu 1 t 

comput 3 a, e,i 
compute 1 r 
computer 0 - 


2.21 Syntactic Unit — Language unit such as word or 
phrase. 


2.22 Table Lookup — This involves storing a table of 
terms along with their root forms. Matching of query 
terms and the one present in table results in getting the 
associated stems. 


2.23 Theory Driven — Decisions or progress that 
involve human judgment. 


2.24 Word Frequencies — A statistical approach 
used to count words appearing in the text. It can be 
used as an approach for content analysis. Assumption 
behind this method is that the frequency of a word in 
text assigns weight to the term in association with the 
text 1.e. frequently appearing words in a text represents 
that the text is more likely to be about the concepts. 


3 PROCEDURES INVOLVED IN CONTENT 
ANALYSIS 


As, there is no simple right way to conduct content 
analysis, therefore, with this document efforts has been 
made to illustrate the steps involved in conducting 
content analysis. 


Step Description 


Manual Approach 


Automated Approach 


1 


Defining a Question 


Following aspects need to be focused 
before framing the question: 


a) Who will be the users? 


b) What we are trying to achieve? 
(Purpose) 


c) What others are doing in the 
field? 


d) What will be the process? 


e) What platform or technology we 
are planning to use? 


Limited application of 
information, and wisdom 
— generally restricted to 
linear resources, single or 
small group of person(s) 


Spectrum is comparatively broad 


Determine the Representative 
Sample from the Population 


Identifying and defining relevant 
and sufficient text to be analysed. 
Sampling may either be random or 
purposive depending on the approach 
adopted for the process. 


Sample 
small 


population is 


Comparatively larger 


Data Preparation 


It involve addition, 
formatting of data 


cleaning, 


Manual scanning of 
content and removing 
the resources which are 
not associated with the 
research question 


Purpose is to process/encode the 
data in the form suitable for analysis 
that 1s, transforming the data such 
that the embedded information in 
content can be best exposed 
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Step Description Manual Approach Automated Approach 
4 | Categorization Scheme and Defining | Formation is theory | Approach for development is data 
the Rules for Coding driven. Instead of | driven. While recognising word/ 
mathematical value, | phrase, it has been presumed that 


It involves either using/refining the 
existing categorisation scheme or 
developing the new scheme based 
on the purpose of the study. Further, 
defining of coding rules help in 
achieving consistency throughout the 
process, every time. This step helps 
to overcome the problems associated 
with the use of homographs, synonyms 
etc. 


subjective approach is 
preferred to establish 
association between the 
words/phrases 


they are indicative of the subject of 
the content 


There should be rules to lay down 
the formal program in a manner 
to identify and record the words/ 
phrases from the text according to 
the defined approach i.e. frequency 
basis, co-occurrence test, pattern 
rule, correlation coefficient. 


Coding 


It involves essence capturing and 
categorisation. Code to syntactical 
unit is assigned on the basis of 
categorisation scheme. Coding 
reduces the data to numerical values 
and helps in establishing links between 
the syntactical units which can assist 
further to move from data to idea. 


Human coders perform 
the task ^ manually 
by using predefined 
categorisation scheme. 


As the coding is performed by 
computer therefore, it is free from 
inter-coder differences. 


Coding can be performed by using 
one of the following approaches: 


e Content has been categorised/ 


coded on the basis of 
predefined categorisation 
scheme like ontology etc. 
For such case there should 
be a provision such that the 
application first consults the 
uploaded scheme to extract 
the exact match. In case 
of absence of exact match, 
facility to match the words 
according to prefixes and 
suffixes must be there. Further, 
there should be a provision 
for a separate list of words/ 
phrases which are present in 
content but not in the scheme, 
for later examination 


On the basis of frequency of 
occurrence of words in the 
content that serve as basis to 
describe the categories and 
to group different words into 
one lemma by using statistical 
procedures 
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Step 


Description 


Manual Approach 


Automated Approach 


Checking the Coding Consistency 


To establish validity and reliability 
checking, it is essential that whenever 
different people/system codes the 
same text they always get the same 
results. 


This depends greatly 
on the consistency and 
understanding of coder 
for the coding and 
classification scheme 
deployed. Instruction 
for coder described 
in categorisation 
scheme help in 
attaining reliability or 
consistency in the sense 
of reproducibility i.e. 
getting same results 
of coding every time 
even the task has been 
performed by different 
people. 


Routine or repeated cycle of 
operations forcalculating the desired 
results, and revising the syntactical 
units, codes assigned, rules defined 
for searching, and analysis program 
for effective results. 


Drawing 
Inferences 


Conclusion/Making 


This involves analysing the outputs of 
the coding process. Result can either 
be represented in form of hierarchical 
structure/tree diagram, graphs, etc. 


Reasoning abilities of 
intellectual are used in 
making inferences from 
the coded data. Output 
is usually in form of 
numbers, percentage, 
tables, graphs 


Make use of statistical techniques/ 
packages, pre-defined order of 
text, and artificial intelligence to 
interpret the coded data. Results can 
be represented in form of triples like 
as in case of RDF syntax or network 
graphical map. 


Reporting the Methods Adopted 
and Findings 


Complete and fair explanation of 


Linguistic as well as 
cognitive approach are 
applied 


Cognitive approach that make use 
of numerical and graphical methods 
is generally applied 


procedures adopted, and decisions 
taken during the process need to be 
reported to support the replication of 
the study. While reporting, level of 
description, and interpretation need to 
be focused. Further, form and level of 
reporting depends upon the research 
question 


Thus, in comparison to the manual or coder-based 
approach, content analysis when performed with the 
help of computer technology is free from inter-coder 
variation. Further, coding in an automated environment 
is replicable and reliable. 


4 AVAILABLE FORMATS 
Content on web is evolving regularly in variety of 
formats like: 

a) HTML, XML, Rich Text Format files (text); 

b) JPEG, PNG, GIF, PDF files (image); and 

c) MPEG, WMV. MP4 files (video). 


Rapid growth and variety of formats available for 
content on web demands a method that can help to 
manage and understand the web content semantically 


both by humans and machines. For this, web content is 
usually integrated with ontology to add semantics and 
which can further help in performing content analysis, 
especially when the task has been performed with the 
help of computer. 


As far as analysis of content is concerned, there are 
two basic approaches for content analysis namely, 
quantitative and qualitative approach. The task to 
perform quantitative content analysis can be performed 
by using various applications like Excel, SPSS, SAS 
etc. On the other hand, for qualitative approach where 
semantics 1s the core there is no single approach to 
perform the task. 


The following section describes the steps for qualitative 
analysis: 


5 DATA FORMAT 
INTELLIGENCE 


FOR LAYER OF 


Though there exist distinct data formats, but the one 
that suits best to the requirement of relational content 
analysis should have following general features: 


a) Capable to express data without losing information; 


b) Able to utilise distinct tools meant to interpret data 
completely and correctly as far as possible; and 


c) Supports interoperability. 


Further, data format adopted for content analysis 
should be such that in addition to interpretation and 
representation of content, it should also promote 
sharing. Instead of imposing limitations, it should 
be flexible enough to permit modifications and 
additions in the procedure, whenever required. With 
such requirements, World Wide Web Consortium 
(W3C) recognised Resource Description Framework 
(RDF) as standard to describe resources on 
internet. RDF is based on XML syntax with triples 
(subject-predicate-object) as its fundamental unit. 


Thus, RDF can be used to analyse the web content by 
coding it into a semantic graph by using the triples 
as base. The graphical representation is capable 
enough to denote the relations between the nodes, 
which symbolise the core concepts of the content. 
Moreover, working with RDF enables us to use the 
existing tools like English lexical database of Princeton 
WordNet, whose RDF conversion is available on 
http://semanticweb.cs.vu.nl/lod/wn30/ can be used 
for semantics. Binding with such language rich tools 
help in conducting content analysis with ease. 


Further, RDF allows addition of metadata to describe 
the graphical representation in form of Uniform 
Resource Identifier (URI), and other statements can 
use the URI as subject or additional information about 
statements. 


6 DATA PREPARATION AND VALIDATION 


Content to be analysed need to be encoded in some 
standardised machine readable formats like HTML, 
XML etc. If we talk in term of existing web world, then 
most of the webpages are developed by using HTML 
syntax. Like human language, markup language also 
bears syntax, and vocabulary. Similarly, there are 
grammar rules for markup languages that need to be 
followed while contributing to web. To verify the 
correctness of grammar, tools called validators are 
usually preferred by the experts. “Markup Validation 
Service’, “XML Validator”, ‘RDF Validator’ are some 
examples of validator services offered by W3C. 


7 CATEGORISATION SCHEME 
DEFINING THE RULES FOR CODING 


AND 


Due to the continuous growing and evolving behavior 
of web content, it is difficult to describe the valid 
expressive categories. In such scenario, tools like 
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ontology serve the purpose, which in addition to 
semantic maps can offers interoperability with which 
currency of information can be maintained up to some 
extent. Further, training to coders become essential 
while dealing with the analysis of web content. 


8 CODING 


8.1 Coding refers to tagging and deals mainly with 
organising large text into few content categories. It 
involves grouping and labeling of similar units by using 
the categories defined in categorisation scheme. The 
process offers an enriched semantic structure suitable 
for analysis and interpretation. To achieve reliability 
and consistency in content analysis process, there 
should be a systematic procedure to perform coding. 
Coding involves: 


8.1.1 Essence Capturing 


This is an ongoing process. It involves highlighting and 
collecting the core concepts that represent the essence 
of text. It involves scanning of content selected for 
analysis process that is suitable enough to represent 
the theme. Task could be performed either manually 
through close reading of content, if the sample size is 
small. Capabilities of computer can be utilised in case 
of machine readable content, which usually help in 
defining the syntactical units on the basis of frequency 
of occurrence of concepts and by comparing the rank 
order of concepts. Computer assisted approach can be 
used when large content need to be analysed. After the 
initial phase, syntactical units need to be reviewed, 
described, refined, and distilled, if required. 


8.1.2 Preprocessing 


The task is performed to simplify the coding process. 
In computer assisted approach, preprocessing is 
performed by the application selected for coding. 
Activities covered under preprocessing are as follows: 


a) To define the phase boundary on the basis of 
punctuations. 


b) To describe word as, noun, verb etc.: There may 
be distinct approach to fix the word as noun, verb 
or any other category. Generally, noun in content 
is identified by considering the articles attached 
with the word or on the basis of capitalisation, 
whereas verbs are recognised by comparing 
the sentence with the phrase in the predefined 
dictionary. Secondly, stop list in combination with 
the predefined vocabulary can be considered to 
define the nodes. 


wm 


Stemming techniques are to reduce variant word 
forms to common meaningful base form. Common 
approaches used for stemming are affix removal, 
successor variety, table lookup etc. 


c 


d 


— 


Syntactic parsing involves parse tree on the basis 
of structural relationships between the concepts 
in the phrase. This is to discover the meaning 
embedded in the sentence. 
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Example: 
*Semantic analysis of web content" drawn by 
using online tool available at the URL http:/nlp. 
stanford.edu:8080/parser/index.jsp 


Parse 

(ROOT 

(S 

(NP 

(NP (JJ semantic) (NN analysis)) 
(PP (IN of) 

(NP (NN web)))) 

(ADJP (JJ content)))) 


where S-sentence; NP- noun phrase; JJ-adjective; 
NN-singular or mass noun; PP, preposition; 
ADJP-Adjective phrase. 


8.1.3 Categorisation 


This involves grouping of codes or syntactical units 
on the basis of similarity among them. Categorisation 
helps in defining the semantic patterns between the 
codes on the basis of coding schemes, which are used 
to define the relationships between the codes. 


Based on the approach adopted for coding, the task 
of content analysis can be categorised into two 
that is, manual, and computer-assisted operation. 
When the coding is performed by a group of human 
coders, it is termed as manual approach, whereas 
in computer-assisted method, coding has been 
performed automatically without human intervention. 
Both the approaches have their own limitations like, 
manual approach demands time and efforts whereas 
in automatic analysis, coding has been performed 
without considering various sources of complexity 
like ambiguity of homographs, metaphor usage, 
pronominal references etc. Therefore, aspects like 
experience, cost, and time need to be considered while 
making methodical decision about the suitable mode 
for content analysis. 


Considering the benefits and limitations associated 
with both the methods, another approach that integrates 
human and computerised coding mechanism has now 
been preferred by the people. Here coding is performed 
by human coders using the capabilities of computer 
programs. 


The task to categorise the web content into distinct 
classes and subclasses can be performed qualitatively 
by defining the skeletal vocabulary of the data format. 
Though, every data set has its own vocabulary, but 
defining the categories in a standardised manner ensures 
the reliability and validity of data interpretation. To 
present an overview of RDF vocabulary, W3C has 
issued a specification whose details can be referred 
from the URL https://www.w3.org/TR/2004/REC- 
rdf-schema-20040210/ 


Syntactic parse tree for the query RDF classes 


Class name Comment 
rdf: XML Literal | The class of XML literals values. 
rdf: Property The class of RDF properties. 
rdf:Bag The class of unordered containers. 
rdf:Seq The class of ordered containers. 
rd£ Alt The class of containers of 
alternatives. 
rdf: List The class of RDF Lists. 
RDF properties 
Property | comment domain | range 
name 
rdf: type | The subject is | rdfs: rdfs: Class 
an instance of | Resource 
a class. 
rdf: first The first item | rdf: List | rdfs: 
in the subject Resource 
RDF list. 
rdf: rest The rest of the | rdf: List | rdf: List 
subject RDF 
list after the 
first item. 
rdf: The subject rdf: rdfs: 
subject of the Statement | Resource 
subject RDF 
statement. 
rdf: The predicate | rdf: rdfs: 
predicate | of the Statement | Resource 
subject RDF 
statement. 
rdf: object | The object rdf: rdfs: 
of the Statement | Resource 
subject RDF 
statement. 


9 CHECKING THE CODING CONSISTENCY 


Information on web is expanding regularly. Due to this 
very nature of the internet world, testing the reliability 
of coded data in internet world is a challenging task. 
Consistency in coding can be achieved by using coding 
guidelines like W3C web coding standards etc. Coding 
instructions, training, regular and routine validation 
of coded data can offer certain relief in this direction. 
Inter coder reliability can also be established by using 
distinct statistical approaches like percent agreement, 
correlation coefficient etc. Whereas to ensure error-free 
syntax code, validators created by distinct bodies like 
various validator services of W3C can be approached 


a) W3C Markup Validation Service — To verify 
the syntax of HTML and XHTML documents. 


b) XML Validator — To check the syntax of XML 
document. 


c) CSS Validation Service — To check Cascading 
Style Sheets (CSS) and XHTML documents etc. 


10 DRAWING INFERENCE 


To draw inference means adding more about data set 
than what is already known. In RDF, every statement is 
expressed in triples and represents facts in a meaningful 
sense. The understanding obtained can again be inferred 
in triples. The system can continue as far as it is needed 
to draw a suitable conclusion. 


11 CONCLUSION 


A phrase of natural language has both a linear and a 
hierarchical structure. This property if inferred by 
means of tree or context-free phrase structure will 
definitely add to learning and understanding. 


Example: 
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12 ABBREVIATIONS 


a) GIF - Graphics Interchange Format 

b) HTML - Hypertext Markup Language 

c) JPEG - Joint Photographic Experts Group 

d) MPEG - Moving Picture Experts Group 

e) PDF - Portable Document Format 

f) PNG - Portable Network Graphics 

g) RDF - Resource Description Framework 

h) SAS - Statistical Analysis Software 

j) SPSS - Statistical Package for the Social Sciences 
k) WMV - Windows Media Video 
m) XHTML - 

Language 

n) XML - Extensible Markup Language 
p) W3C - World Wide Web Consortium 


Extensible Hypertext Markup 


STATEMENT 


verb 


Verb Phrase 


The library organized its books 
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Bureau of Indian Standards 


BIS is a statutory institution established under the Bureau of Indian Standards Act, 2016 to promote harmonious 
development of the activities of standardization, marking and quality certification of goods and attending to 
connected matters in the country. 


Copyright 


BIS has the copyright of all its publications. No part of these publications may be reproduced in any form without 
the prior permission in writing of BIS. This does not preclude the free use, in the course of implementing the 
standard, of necessary details, such as symbols and sizes, type or grade designations. Enquiries relating to 
copyright be addressed to the Director (Publications), BIS. 


Review of Indian Standards 


Amendments are issued to standards as the need arises on the basis of comments. Standards are also reviewed 
periodically; a standard along with amendments is reaffirmed when such review indicates that no changes are 
needed; if the review indicates that changes are needed, it is taken up for revision. Users of Indian Standards 
should ascertain that they are in possession of the latest amendments or edition by referring to the latest issue of 
“BIS Catalogue’ and “Standards: Monthly Additions’. 


This Indian Standard has been developed from Doc No.: MSD 05 (14594). 
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